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ABSTRACT 

Among the computer-based methods used for the 
construction of trees such as AID, THAID, CART, and FACT, the only 
one that ''.ses an algorithm that first grows a tree and then prunes 
the tree is CART. The pruning component of CART is analogous in 
spirit to the backward elimination approach in regression analysis. 
This idea provides a tool in controlling the trep siz.es to some 
eytent and thus estimating the prediction error Dy the tree within a 
certain range of tree size. In the CART pruning process, Breiman, 
Friedman, Olshen, and Stone (198A) use a linear combination of the 
expected loss of the decisions by the tree and the total number of 
the terminal nodes of Che tree. In this paper, CART's pruning is 
extended by considering a function of all the nodes of the tree in 
addition to the factors involved in the linear combination. For 
example, if the cost of observing a variable at each node is 
considered as the main concern of this paper, or the structural 
complexity of the tree, such an extension can be seen. (Contains two 
figures and six references.) (Author) 
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Abstract 



Among the computer-based methods used for the construction of trees such as AID, 
TH AID, ART and FACT, the only one that uses an algorithm that first grows a tree 
and then prunes the tree is CART. The pruning component of CART is analogous 
in spirit to the backward elimination approach in regression analysis. This idea 
provides a tool in controling the tree sizes to some extent and thus estimating the 
prediction error by the tree within a certain range of tree size. In the CART pruning 
process, Breinian, Friedman, Olshen, and Stone (1984) use a linear combination of 
the expected loss of the decisions by the tree and the total number of the terminal 
nodes of the tree. In this paper, CART's pruning is extended by considering a 
function of all the nodes of the tree in addition to the factors involved in the linear 
combination. For example, if wo consider the cost of observing a variable at each 
node as is the main concern of this paper, or the structural complexity of the tree, 
we can see such an extent ion. 

Key Words: decision-support tree; optimal pruning; the smallest optimally 
l)rune(i s'.ibtree; sufficient tree. 
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1 Introduction and Motivation 



("onsidcr a sequential decision making proi)lcni wliere observations are made 
sequentially depending upon the outcome of the previous observation, and after eacli 
observation a decision is to be made on wlietlier to continue observation or to stop 
observing and make a final decision about the dependent (or response) variable. If 
we depict tliis sequential process from tlie first observation of a random variable 
through to the final decisions in a graph, we will end up with a tree-like structure, 
under the condition that the observations are made on categorical variables only. 
We will call snch a graph a tree. We define a tree in a graphics terminology as a 
c(.)nnected, directed and acyclic graph where there is only one path from one vertex 
to another, and the direction indicates the sequence of observations. The graph (a) 
of Mgure 2.1 in section 2 is an illustration of a tree, where observations are made at 
the circles and a box symbolizes a final decision. We will call the circles the nodes, 
and the boxes the terminal nodes. 

Trees are among tiie data analysis tools (factor analysis, nonparamet ric scaling, 
and so forth) that have been i)ropose(i by social and biomedical scientists motivated 
l)y t h(> need to cojx' with actual data prf)bleniK involving large numbers of variables, 
in particular, lireiman, rriedmaii, OLsheii, and Stone (Ifls-l) note that the tree- 
stru(iur(nl methods are very competent in finding a cla.ssification rule when the 
comiilexity of a data .'iet includes asi)ects such a,s high dimensionality, a mixture of 
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datatypes (e.g., qiiantitat i\'e and qualitative, or from ilillen'nt -toclia.-^iii' nii)d<'i> . 
variation of dimensions over eleuuMits in ti'.e data ^et. or iionlioinoiieii.'ii y. 

The use of trees in regression analysis dates l^ack to the .Viitoniaiic liiteraciion 
Detection program (AID) developed at the Institute for Social Rfsearcli. I'liiversi'v 
of Michigan, by Morgan and Sonquist ^19G3), which was followed liy 1 li(> classifi- 
cation program TIIAID, developed by Morgan and Messenger (l!J7;i). Iheiman el 
al. (1984) proposed an algorithm, witich they called Clnt^Kification Ami lUfiri s.^-ion 
Trees, that is designed as a secpiential decision aid for cla.ssificat ion or regres^^io;l 
problems. Given appropriate data, C.-\RT ])rovides a guide, in a form of an upside - 
down tree, for the order in which to observe ])redicior variables, when to sttij) 
servation, and what decision to make about an interested yet-unkno\M! oiitcoiiif. 
The computer jirogram that is bascnl on this algorithm is rcfcrrcHi to a.-i CAR 1". l.oi. 
and Vanichsetakul (198S) subseciuently i.roi)os(Hl an algorithm called Fast Ahiorithm 
for Classification lives which involves recursive n]i])licat ion of linear discriminant 
analysis, with the predictor variables at eacii stage l)eing a])pronriately cIiommi ac 
cording to tlie data and the type of s])lits desinnl. The coini)ut(M" ])i0'.iram based o;i 
the algorithm is called I'.AC T. 

The algorithms that uiiderly AID, TILMD and I'ACT grow a tree liy addiui; in 
branches (\'ariables) as h^ig as a ])ar1icul;i- condition IumN. In c- )iit r.;- ' . 'he i'.Wl \ 
.algorithm constructs a tree in two ste])s lirst, growiui; it and then pruiiin,<; it. li, 
general terms, ("ART use.-- a lo-s lu'iclion in the giowiui; ])roce.-,s. x'.hirli ends wlii'u 



the expoctcd loss no longer dcrroasos (i.e., remains the same). Then, in the CAR'J' 
prnning process, the number of the terminal nodes of a tree is considered in addition 
to the loss function, and the process ends when a linear combination of the number 
of terminal nodes and the expected loss of the tree is minimized. A tree constructed 
in this manner has several desirable properties. Since we use a single loss function 
as a criterion in the growing process, we can read from the tree wiiich predictor 
variable is more informative (conditional on some other predictor variables) based 
on the loss function. The grow-then- prune approach avoids trees being too small 
or too large compared with the trees constructed by a top-down stopping method 
(see Section 3.1 of Breiman et al. (1984) and Breiman and Friedman (1988)). 

cart's pruning is analogous to the backward elimination in regression analysis. 
As the latter is i)roposed as a remedy for stopping too early in the regression model 
searcliing process, so is the former as a remedy for stopping with a too small tree. 
In CART's pruning, we consider the number of the terminal nodes as a complexity 
penalty of the tree. By specifying the penalty rate, we can control the number of 
the terminal nodes within a certain range. In other words, the pruning reduces the 
tree to a certain range of tree sizes. 

As mentioned above, CART's i)runing deals with the terminal nodes only. A 
nn)(ivatiun for an extension of CART's prnning is that we may ex])an(l our attention 
from tiie terminal nodes to all the nodes of a tree in the pruning process. In this 
I)a])er, we will consider the observation cost of (he variables at the nodes along 
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with tho tree size in the pruning process. Here, we can expect an argument against 
the idea of considering the observation cost at the pruning process rather tiian th'- 
growing process. But. if we include tlie observation cost at the growing proce.s^. 
the tree will be less informative about the concerned data structure compared with 
the tree grown using a single loss function. The information would be blurred by 
adding an extraneous factor to the loss function. In this paper, we will derive some 
results that are useful in developing an optimal pruning algorithm using a linear 
combination of the loss function, the number of the terminal nodes, and a function 
of non-terminal nodes. 

The remainder of this paper is organized in 5 sections. In Section 2. we specify 
the basic notation and definitions concerning trees. In Section 3. we discuss the 
basic properties of the function used in pruning. Section 4 is the main part of the 
paper. In it we derive an optimal pruning algorithm under our extended situation. 
Section .5 gives a summary and a brief comment on a possible application of the idea 
behind the methods of Section 4 to other pruning criteria. 
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2 Notation 

Wo borrow most of the negation u.^(^i here from lirciiuan el al. (liKl j. \-\iv 
a tree r. We let f be the set of all the termin;d nodes of tree r and .V(r) the set 
of all the non-terminaJ nodes of tree r. For a set A. we let l)e the number of 
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elements in A. A subtree of a tree r is a tree as a part of r. For t € A^l^), we denote 
by Tt the subtree of r whose root node is t and whose terminal nodc^ are those of 
T that follow from the node t. For notational convenience, when a tree notation 
hcLS a subscript as in r(.), say, we write rj.jj for a subtree of r(.) whose root node is 
t £ -'^(•r^.)). For a non-trivial tree r and a non-terminal node t of r, we denote by 
r \ the pruned subtree of r which is obtained by cutting Tt off of r while leaving 
the node t on the tree. 

{Figure 2.1 about here.} 

If we denote by L the loss function which compares a prediction made from the 
tree and the corresponding outcome of interest, then the conditional expected loss 
of the prediction for the outcome given the results of the predictor variables, A'j = 1 
and ^2 = 1, say, is given by 

= 1,^2- 1). 

Without loss of generality, we may assume that the predictor variables are finitely 
discrete. We denote by r(t) the conditional expected loss at the node t (i.e., when 
we condition on the event described by the predictors up to that node) and we let 

Rit)= P{t).T{t), 

where P(t) is the arrival rate at node t in the tree r (i.e., the probability that tlie 
tree send a subject from tlio root node to the node /). 'J'hcr, the risk in induing 
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predictions from the tree r is 

«(-) = ^/^(/). 

If t is not tlie root node of a non-trivial tree r, then we denote 
where R(s)'s are obtained based on the tree r. 

3 Preliminaries 

As we indicated at the end of Section 1, we wish to consider tlie cost or the 
time of the variable observation at each non-terminal node, and we denote this cost 
at the node / by Wt {W for xucight). We assume Wt > 0 for a non-terminal nud>' 
/ and, Wt = 0 when < is a terminal node since no observation is made there. \\V 
define a cost function for the pruning process which involves Wt and investigate it 
in this section. 

Wc denote by par(t) the node which immediately precedes the node t. For a 
node t of a tree r which is not the mot node, we lot 

where lli(> summation goes f)ver the >et of all the nodes on the i)atli fi'oiii the root 
node through ])ar(t). When t is the root node, we lot \V{t) = 0. Then, we have the 
following result. 



CJ.ri 



(2.2) 



Theorem 3.1 For any tree t, 



tef teN{T) 

For a node t £ t, 



Proof: The proof of (a) is by induction. If t is trivial, we have 

W{t) = 0 = Wt, for tef = {t}. 

Suppose that the result holds for t = t' and that t' is branched at a terminal node 
t' into a new tree t" such that f" = f' U <2, • • • ,<'a}. Then, 

J2Pit)Wit) = Y^Pi1)W{t)-P{t')Wit')+ J2 Pis)W{s) 
= J2 P{t)Wt-P{t')W{t')+ J2 Pi^)^Vis). 

where the second equality follows from the supposition. The last term on the right 
hand side of the last equation is equal to 

PO'Wit\) = Pit'Wit') + w,). 

Thus, we have 

^P(/)H/(0 = ^ Pit)VV, + P{t')VV, 
(eA'(T") 
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The proof of part (b) is direct: 

^ Pis)W{s) = Yl - lt"(0 + "'(0) 

s€f, sen 

= E Pis)iW(s)-\V{t)) + P{t)\V{t) 

sgf, 

5eA'(n) 

where the last equality follows fiom (a). □ 

For non-negative real numbers /3,-, i = 1,2, we now define a new risk function for 
our pruning process: 

ie^(r) = 71(7) + /3iX]^(')»'(0 + Air!, (3.1) 
tef 

where P = (/?i,/?2)- Thus R ^{t) is a linear combination of the risk function R{r) 
used in growing the tree, the number of terminal nodes of the tree |f|, and the 
expected value of W{t) for the terminal nodes t. For any node t g N{t) U f. let 

R = R{t) + ,diP[i)yV(t] + 0,. (3.2) 

and 

R ^j(r,) = R{Tt) + A E + ^h\ft\. (3.3) 

Sg 7", 

The following theorem is straightforward. 



Theorem 3.2 Let r be a non-tvivial tree. Then. Jnr t t A'(7"), 
(«) p{r)-R p{T\T,]= R ^^(T,)~n .^{t). 



(b) For every ancestor s of the node t, 

R pir,) -Rp{Ts\ Tt) = R piTt) - Rpit). 

Proof: From expressions (2.1) and (2.2), we see that 

R{T\Tt) = R{T)-R{Tt) + R{i). (3.4) 

By definition, we have 

\r~rt\ = \f\-\n\ + l, (3.5) 

and 

R0{r\Tt) + R p{Tt)-R p{t) 
= [RiT \ n) + R{Tt) - R{t)] + Pi[T^ P{s)Wis) + P{s)W{s) - P{t)W{t)] 

+P2[\r~Tt\ + \ft\ - 1]. (3. 6) 

But for the expression in the second bracket in expression (3.6), we have 
Pis)W{s) + 53 P{s)W{s) - P{t)Wit) 

E Pi^'Ws+ E Pi^'Ws + Pim'ii)- Pit)w{i) 

seN(T\T,) seN(r,) 

= ^ P{s)W, 
seA'(T) 

where the first and the last equalities follow from Theoiein 3.1. Hence, by combined 
expressions (3.1), (3.4) and (3.5), wc complete the proof of (a). 
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The proof of (b) proceeds in the same manner. First we note that 

Rp{rs\rt) + Rp{rt)-Rp{i) 
equal to the right-hand side of expression (3.6) with r replaced by r,. On the 



other hand, 



P{s)Wis) + ^ Pis)Wis) - P{t)W{t) 



J2 P{u)Wu + Pis)W{s)+ Piu)Wu + P{t)Wit)-P{t)W{t) 

= ^ P{u)Wu + P{s)Wis) 
ueN{T,) 



Therefore, the result follows from expressions (3.4) and (3.5) with r replaced by r^. 
O 

The extended loss function of expression (3.1) is rather difficult to handle, and 
thus we will develop an alternative version of it below and examine its properties 
for use in iinding an optimal pronin^ method in Section 4. Let 

W(Tj = ^ Pis)W{s) - P{t)W{t). (3.7) 
For a fixed non-trivial tree r and a node / G N{t), we let 



and 



Hit) - Rjrt) ,3,. 
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In equations (3.8) and (3.9), we note that, for a non-trivial tree 



\ft\ - 1 > 1 and W(rO > 0. 



(3.10) 



Since a branching gives rise to at least two new nodes, the first inequality of expres- 
sion (3.10) is obvious. Rewriting equation (3.7) gives 



where W{s) - W{t) > 0 for s 7^ t since each individual weight is positive. On the 
other hand, we never grow a tree when the R- value does not decrease. In other 
words, a branching is made at a terminal node t of a current tree, i.e., a simple 
tree r', whose root node is t and whose terminal nodes are the child nodes of t, is 
attached to t, only when 



Therefore, we have that, for every non-terminal node i of r, both gi{t,T) and g2{t, t) 
are positive. 

We now let 



nrt)=YlPis){Wis)-W{t)), 



R{t) > R{t'). 



(3.11) 



A l3i9i{t,T.),g2{t,T)) = g2{t,T) - f}^ - 0i 



giiUrV 



(3.12) 



and we focus on the difference in risk 



D i3{t,T) = Ri3{t)-Rp(T,). 



Then, we have 



A p{9At,T),g2{t,T)) = D p{t,T)/(\f/ - 1). 



(3.13) 
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From equations (3.10) and (3.13), we can see that, for any non-tcmiinal node t of a 
non-trivial tree r, 

sign[A p{gi{t,T),g2{t,T))] = sign[/? ^(/,r)]. (3.1.1) 

Because of equation (3.14), we may use the A-function of equation (3.12) in 
place oi R p to find an optimal pruning method since the increase or decrease of 
i? ^, as given by the sign of D p{t,T), determines where to prune and when to stop 
the pruning process. 5 

4 Extended Optimal Pruning 

We denote by r' :< r the relationship that r' is a pruned subtree of r, and by 
t' < T that r' is a strictly pruned subtree of r, i.e., when r' is a pruned subtree of 
r and r' 7^ r. We call ri an optimally pruned subtree (OPST) of a non-trivial tree 
r with respect to ^ if 

R p{Ti) = mnxR p{T'), 
and we denote by r( P) the smallest OPST of r with respect to P . 

The following theorem is immediate from the transitivity of the relationship ;:<. 
Theorem 4.1 7/r( < t' < t, then r( 13) = r'( p). 

For notational convenience, we write A p{tyT) for A ^(.<7)(', •r),(/2(/.,r)). The 
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following theorem provides us a convenient algebraic tool to deal with R 

Theorem 4.2 Let t' < t, where t' is not a trivial tree. Suppose t is a non-terminal 
node of t' . Let ti e f[ n N{t), for i= 1,2, • • • ,r, where r = |f,' n N{t)\. Then 

A pity) = A ^(«,r) + ;^(A pit,T) - A p{t,,r))\lliLLl, (4.I) 

1=1 - - I'^tl - 1 



Proof: 



= Rit) - R{r[) - (/?, W(r;) + fiM - 1)) 

= R{t) - {R{n) - Y,{R{n^) - R{ti))) - /3i(W(r,) - ^Vy(r,)) 
t=i i=i 

-/?2(|ft|-DK.|-i)-i) 

1=1 

= Rit) - Rin) - /3iW(r,) - P2{\ft\ - 1) - i2{R{U) - R{rt,) 



1=1 



-/3iW(r,)-/32(lr^..|-l)), 



where the first equality follows from expressions (3.2), (3.3) and (3.7). Then, from 
equation (3.13), we have 

Rp{t)-RpiTl) 

= A^(i,r)(|f,|-l)-X;A^(i.-,r)(lr^.l-l). (4.2) 
Dividing both sides of (4.2) by (Ir/I - 1) gives 

13 
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Since 

l^ti^K'l + Blrl.l-l), 
1=1 

the desired result follows. □ 

Recall that /3i = 0 in the CART method. The following corollary is immediate 
from Theorem 4.2. 

Corollary 4.1 Under the set-up of Theorem 4.2, if Pi - 0, then 

92{ty) = 92{i.r) + Y^{9,{Ut) - ^^C^-, r))!^^. (4.3) 

From equation (4.3), we can determine the exact value of gi{i,T') - cj2{t,T), rather 
than whether the inequality g2{t,r') - g2{t,T) > 0 is satisfied (see Theorem 10.11 of 
Brciman et al. (1984)). 

Given P and a non-trivial tree r, we can get a sequence of pruned subtrees of r 
with the corresponding t{ fi). Let r(o) = r, and 

/'o(^) = , >i,V" /3(^^(0))}- 

We define a sequence of trees r^,) and the corresponding numbers //,( /^), for i = 
l,2,---,it) {w a finite number) sequentially as follows: 

Definition 4.1 Let Ti be ■•^uch that 

Nir^i)) = ^(r(.-,)) - {A'(r(,_,),); A ^(f,r(,_,)) = /i,_,( P)} M.-l) 

11 



for i = 1 , 2, • • • , u;, and then let 

l^i{P)= mm {A 0{t,T(i))}, for i = 1,2,- ■■ ,w. (4.5) 
(e-v(T(,)) y 

We can continue to apply equations (4.4) and (4.5) sequentially until we reach the 
trivial tree (i.e., root(r)). Let r(u,) = root{T). 

The following theorem is a big step towards the aim of establishing aji algorithm 
of an extended version of CART. 

Theorem 4.3 For a non-trivial tree t, suppose T(o)i ^(i) i • • • > ''^(u') = roo/(r) are 
obtained as in Definition Then 

(a) 7(0) r(i) > y T^^). 

(h) Fori € iV(r(,)), i = 1, 2, • • • , u; - 1, 

Proof: Part (a) follows directly from Definition 4.1. Let ii,i2,---,<r € r(',) ri 
A'(t(,_i)). For the node t in (b), we can t!iink of two cases. They are (1) r(,); = 
^(t-i)ti a.nd (2) •r(,)( -< In case (1), the result is immediate. In case (2), by 

Definition 4.1, we have 

^ - •^(t-i)) > 0, for = l,2,---,r. 

Therefore, part (b) follows from Theorem 4.2. □ 
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In particular, if pi = 0, then we can say, by Corollary l.l, under the condition 
of Theorem 1.3, that, 

92{i-^,-i)) = 92[Un,)) if ^(.)! = (.1.7) 
52(',T(,-i)) < 52(^T-(,)) if X -r(,_i)(. 

The result (4.7) is well harnessed in the CART method. However, when 3i 0. 
the result (4.7) is of no use. 

The following result, which is useful in finding the smallest OPST t{ 3), is 
immediate from Theorem 4.3. 

Corollary 4.2 Let r be a non-trivial tree. Suppose we obtain {t(^,)} be obtained as 
in Definition 4-1 for some 0. Then 



Proof: 



/'.( ^) = mm {A git.r^,))] 

> min {A ^(f,r,, (by Theorem 4.3) 

(6.V(t(,)) 

= /'.-i(.'^). □ 

The following result proves that the set of the pruned subtrees obtained as in 
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Definition -1.1 contains the smallest OPST. 



Theorem 4.4 Let t be a non-trivial tree. Suppose that lue obtain as in Def- 

inition 4- i for some P. Then 

Ti ^) € {t-(o),t-(i),---,T(,^)}- 
Proof: For any t' such thai 

riP.)<r' ^r,o^, (4.8) 

we have, by Theorem 4.1, that 

For any t' sati ying expression (4.8), 

min {A /9(<,r')} < 0. 

Otherwise, there must exist a subtree t' satisfying expression (4.8) such that R pir') < 
R p{t'{ P)) = R p{T{ P)), which is a contradiction. 

By the definition of r( P), we have that for every t G N{t{ P)), 

^ p{t.r{P))>0. 

Therefore, wo can find the smallest OPST r( ;^) in the set of T(o), r(i), ■ • • , rj^^), which 
is obtained in the process of Definition 4.1. □ 

17 
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Theorem 'l.-l implies (hat we liave only to look at r^^), 7,,), •••.''■(„) to fiiui t{ 
But {ni{ /•^)}-'lo more useful for fuuliug r( •3) through the moiiotonicity of /(,( 
as shown in Corollary 4.2. 



Theorem 4.5 Letr be a iwn-trivial tree. Suppose {"(,-)} arc obtained as in Defini- 
tion 4-1 for some P. If, for some i' , !<?'*< u-, 



P) <0 and /!,•.( P) > 0, 



then 



Proof: Suppose, for i = 1,2, ••• ,u', there exist ^1 , i2, ••• ,ir, € Tf',) nA'(r(,_,)). Then 
by repeated use of Theorem 3.2 (a), we have 

R ^(r(,)) - /t = ^(H 0{tj) - R ^(^O-,),,)) 

; = 1 

T, 

= /i.-i(.5)x:(in.-.)<j-i)- 

where the last equation is appareiii by Definition 4.1. 
By Corollary 4.2, wo have 

H,{fi)<Q for !<)"-2. and 
H,(l})>0 for ■>)'. 

24 



Thus 



< 0 for i < r - 2 
> 0 for i > i'. 



Furthermore 



= 0 if^.-._i(.^) = 0 
<0 if /3) < 0. 



Therefore, 



and the result follows from Theorem 4.4. □ 



We now have the following summarizing result. 
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Theorem 4.6 Let t be a non-trivial tree. Suppose {t"(,)} are obtained as in Defini- 
tion 4-1 for some f3. Then 

T(k+\) iffJ-k{P) = 0 
<P.)=\ T^k) t//iJt( P) > 0 and P) < 0, for k > 1 

^(0) ifMP)>Q- 

Proof: If f.io{ P) > 0, then the result is obvious. If P) - 0. then by Corollary 
4.2, it is immediate that 

fii{ P)<0 for i < k 

and 
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n rr 



ft)>0 for i>k+ 1. 

Thus, by Theorem 4.5, r(/?) = Tf^k+i)- '■'•'^ same tlicorcm, wo chii see that 
-r( P) = r(fc), when ^) > 0 and ^) < 0, for /; > 1. □ 

Theorem 4.6 is the main result of this section and the paper. For a givt'ii 
however, it is by no means desirable to grow a tree far beyond the optimal tree r( ) 
before pruning up until r( /?) is reached. Theorem 4.6 is available wlionovor the Uco 
r thereof contains r( P). At the end of Section 10.2, Brciman et al. (1984) tiisciiss 
a method by which one can find a tree which may not be fully grown involving all 
the possible predictor variables and which contains r( P). The following results nji 
to Theorem 4.9 are straightforward extensions of Theorem 10.31, 'J heorem 10.32. 
and the subsequent paragraphs in Section 10.2 of Breiman et al. (1984) and thus 
their proofs will be omitted. 

Let node(s,t) denote the set of nodes on the path from node through node / 
and l{s,t) the number of connections on the same path, i.e., l{s,t) = no(lc{s.t) - 1. 
For t G N{t) U r, denote by anc{t,T) the set of all the ancestors of node / in the 
tree r. Define, for a non-terminal node t of a non-trivial tree r, 

V0{t) = mir ^ ms) - Pi\\(s,t) - P2ilis,1) +. 1)}. 

' .'!eanc(()U{(} 



where 



u&nodc(s,t) 
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Theorem 4.7 Let t be a non-trivial tree. Then, 

Vpit)>o, Vie n{t{3)). 

If we define 

rsuff{^_) = {te N^T)uf; Vp{.s)>0, ^secmc{t]}, (4,9) 
then we have the following result. 

Theorem 4.8 For a given P , 

The tree T,uffi P) contains r( so we don't need to go beyond r,,„jj{ P) before 
starting pruning toward r( P). 

V can be defined recursively as in the theorem below. 

Theorem 4.9 For any non-terminal node t of a tree r and a non-ncgalivc vector 
ft, 

Vpit) = m\n{R{t),V pipar{t))} - p,P{t)\\\ - P2. 

5 Concluding Remarks 

In CART, Breiman et al. (1984) prune trees using a.s criterion a linoar roinbination 
of the risk of the predictions and tiie total number of tlic terminal nodes. Hcm'o, wc 
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have extended the CART pruning algoritliin, in th.c sense that we consider the cost 
of the variable observation in addition to the factors used in the CART's i)runinj.'; 
criterion, and we have derived results useful for a pruning algorithm under this new 
criterion. CART's pruning algorithm can thus be viewed as a special case (/?] = 0 
in expression (3.1)) of the algorithm considered in this paper. 

Equation (4.1) of Theorem 4.2 plays a key role in deriving the pruning algorithm. 
It is a useful algebraic tool in dealing with functions defined on trees. Versions of this 
equation would be possible under various pruning criteria. For example, Arbab and 
Miche (1985) considered degree of linearity of a tree as a measure of desirability for 
trees. Their degree of linearity is represented in terms of the non-linearity measure 
whicl. is defined as follows; 

Let r be a tree which is composed of the root node and m major 
subtrees, ri, r2, • • • ,r,„ as in Fig. 5.1. Then the non-linearity of the tree 
r is given by 

1 

NL{t) = - X ;^{A'L(r,) + {m - i) x |A^(r,)|}, 
t=i 

where N L{t) = 0 if r is trivial, and A'(r,) arc sorted in inci-easing order 
of |A'(r.)|. 

Given a tree, the non-linearity of the tree is defined over llie set of the ma jor 
subtrees of the tree. And thus at each non-terminal node L, say, of the tree we ran 
assign a non-linearity measure of the subtree whose root node is I. If we considered 
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this linearity of a tree in addition to the cost function considorod in this paper, our 
priming method should be more complicated than the present one. 'JMiis further 
extended version of the pruning method seems to be an interesting ;/rol)lcni lo 
pursue. 

{Figure 5.1 about here.} 
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